Goto

Collaborating Authors

 Northern North Sea


How good are LLMs at Retrieving Documents in a Specific Domain?

Islam, Nafis Tanveer, Zhao, Zhiming

arXiv.org Artificial Intelligence

Classical search engines using indexing methods in data infrastructures primarily allow keyword-based queries to retrieve content. While these indexing-based methods are highly scalable and efficient, due to a lack of an appropriate evaluation dataset and a limited understanding of semantics, they often fail to capture the user's intent and generate incomplete responses during evaluation. This problem also extends to domain-specific search systems that utilize a Knowledge Base (KB) to access data from various research infrastructures. Research infrastructures (RIs) from the environmental and earth science domain, which encompass the study of ecosystems, biodiversity, oceanography, and climate change, generate, share, and reuse large volumes of data. While there are attempts to provide a centralized search service using Elasticsearch as a knowledge base, they also face similar challenges in understanding queries with multiple intents. To address these challenges, we proposed an automated method to curate a domain-specific evaluation dataset to analyze the capability of a search system. Furthermore, we incorporate the Retrieval of Augmented Generation (RAG), powered by Large Language Models (LLMs), for high-quality retrieval of environmental domain data using natural language queries. Our quantitative and qualitative analysis of the evaluation dataset shows that LLM-based systems for information retrieval return results with higher precision when understanding queries with multiple intents, compared to Elasticsearch-based systems.


Modelling non-stationary extremal dependence through a geometric approach

Murphy-Barltrop, C. J. R., Wadsworth, J. L., de Carvalho, M., Youngman, B. D.

arXiv.org Machine Learning

Non-stationary extremal dependence, whereby the relationship between the extremes of multiple variables evolves over time, is commonly observed in many environmental and financial data sets. However, most multivariate extreme value models are only suited to stationary data. A recent approach to multivariate extreme value modelling uses a geometric framework, whereby extremal dependence features are inferred through the limiting shapes of scaled sample clouds. This framework can capture a wide range of dependence structures, and a variety of inference procedures have been proposed in the stationary setting. In this work, we first extend the geometric framework to the non-stationary setting and outline assumptions to ensure the necessary convergence conditions hold. We then introduce a flexible, semi-parametric modelling framework for obtaining estimates of limit sets in the non-stationary setting. Through rigorous simulation studies, we demonstrate that our proposed framework can capture a wide range of dependence forms and is robust to different model formulations. We illustrate the proposed methods on financial returns data and present several practical uses.